Attaching package: 'quanteda.sentiment'
The following object is masked from 'package:quanteda':
data_dictionary_LSD2015
knitr::opts_chunk$set(echo =TRUE)
Using the below code for checking if the website is scrapable or not
bow("https://www.amazon.com/")
<polite session> https://www.amazon.com/
User-agent: polite R package
robots.txt: 149 rules are defined for 2 bots
Crawl delay: 5 sec
The path is scrapable for this user-agent
Acquiring the data
The below function is used for web-scrapping reviews from Amazon. I am acquiring review title, review text and number of stars for the review. I have collected the reviews for the following books:
A Game of Thrones: A Song of Ice and Fire, Book 1
A Clash of Kings: A Song of Ice and Fire, Book 2
A Storm of Swords: A Song of Ice and Fire, Book 3
A Feast for Crows: A Song of Ice and Fire, Book 4
A Dance with Dragons: A Song of Ice and Fire, Book 5
Twilight: The Twilight Saga, Book 1
New Moon: The Twilight Saga, Book 2
Eclipse: The Twilight Saga, Book 3
Breaking Dawn: The Twilight Saga, Book 4
The Hunger Games
Catching Fire: The Hunger Games
Mockingjay: The Hunger Games, Book 3
scrape_amazon <-function(ASIN, page_num){ url_reviews <-paste0("https://www.amazon.com/product-reviews/",ASIN,"/?pageNumber=",page_num) doc <-read_html(url_reviews) # Assign results to `doc`# Review Title doc %>%html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%html_text() -> review_title# Review Text doc %>%html_nodes("[class='a-size-base review-text review-text-content']") %>%html_text() -> review_text# Number of stars in review doc %>%html_nodes("[data-hook='review-star-rating']") %>%html_text() -> review_star# Return a tibbletibble(review_title, review_text, review_star,page = page_num, ASIN) %>%return()}
Using the above function I have scraped equal number of reviews for each series to compare them. I have used a for loop and sleep time of 2 seconds to avoid bot detection. Then converted the whole data into csv format.
Reading the data
reviews <-read_csv("amazonreview.csv")
New names:
Rows: 46450 Columns: 6
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): review_title, review_text, review_star, ASIN dbl (2): ...1, page
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
reviews
Data Preprocessing
I have cleaned the text of the reviews by removing punctuations, numbers, UTF symbols, &, digits, new line characters, single length words, Pascal case words were removed from the tweets text using the stringr library departments communicate information to alleviate specific public functions. Stopwords are removed. The stopwords collection is taken from stopwords-iso and SMART. Stemming is not preferred here as the meaning of the word is important for analysis. And then categorized the data based on the book title and series title. The analysis done in the project is based on the categorized series titles to compare sentiments and topics on basis of series.
clean_text <-function (text) {# Remove urlstr_remove_all(text," ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%# Remove mentionsstr_remove_all("@[[:alnum:]_]*") %>%# Replace "&" character reference with "and"str_replace_all("&", "and") %>%# Remove punctuationstr_remove_all("[[:punct:]]") %>%# remove digitsstr_remove_all("[[:digit:]]") %>%# Replace any newline characters with a spacestr_replace_all("\\\n|\\\r", " ") %>%# remove strings like "<U+0001F9F5>"str_remove_all("<.*?>") %>%# Make everything lowercasestr_to_lower() %>%# Remove any trailing white space around the text and inside a stringstr_squish()}
Feature co-occurrence matrix of: 3,454 by 3,454 features.
features
features love fantasy kid stories set creative worlds groups characters
love 13025 2234 337 1701 1155 224 161 72 11380
fantasy 0 1779 45 554 456 76 99 20 6374
kid 0 0 48 23 27 6 5 5 223
stories 0 0 0 313 190 28 33 33 2790
set 0 0 0 0 237 26 29 11 1992
creative 0 0 0 0 0 22 3 2 309
worlds 0 0 0 0 0 0 8 4 261
groups 0 0 0 0 0 0 0 2 114
characters 0 0 0 0 0 0 0 0 15686
fighting 0 0 0 0 0 0 0 0 0
features
features fighting
love 579
fantasy 149
kid 6
stories 51
set 78
creative 13
worlds 5
groups 4
characters 723
fighting 54
[ reached max_feat ... 3,444 more features, reached max_nfeat ... 3,444 more features ]
Finding the top features from the above feature co-occurrence matrix and ploting the network plot.
# pull the top featurestop_features <-names(topfeatures(text_fcm, 50))# retain only those top features as part of our matrixeven_text_fcm <-fcm_select(text_fcm, pattern = top_features, selection ="keep")# check dimensionsdim(even_text_fcm)
[1] 50 50
# compute size weight for vertices in networksize <-log(colSums(even_text_fcm))# create plottextplot_network(even_text_fcm, vertex_size = size /max(size) *2)
I can observe that book has the highest count and then there are interesting words like great, enjooyed, amazing, good, love which express the feelings of the people which can help in our sentimental analysis.
Further study
I will be transforming and categorizing data and also plot some analysis plots and if possible also do some sentiment analysis.
Source Code
---title: "Blog Post 2"author: "Mani Shanker Kamarapu"desription: "Acquiring and Preprocessing data"date: "10/2/2022"format: html: df-print: paged css: styles.css toc: true code-fold: false code-copy: true code-tools: truecategories: - Post2 - ManiShankerKamarapu - Amazon Review analysis---## IntroductionIn this blog I plan to scrape reviews on different products in Amazon and do pre-processing of data.## Loading the libraries```{r}library(polite)library(rvest)library(tidyverse)library(stringr)library(quanteda)library(tidyr)library(RColorBrewer)library(quanteda.textplots)library(wordcloud)library(wordcloud2)library(devtools)library(quanteda.dictionaries)library(quanteda.sentiment)knitr::opts_chunk$set(echo =TRUE)```Using the below code for checking if the website is scrapable or not```{r}bow("https://www.amazon.com/")```## Acquiring the dataThe below function is used for web-scrapping reviews from Amazon. I am acquiring review title, review text and number of stars for the review. I have collected the reviews for the following books:- A Game of Thrones: A Song of Ice and Fire, Book 1- A Clash of Kings: A Song of Ice and Fire, Book 2 - A Storm of Swords: A Song of Ice and Fire, Book 3 - A Feast for Crows: A Song of Ice and Fire, Book 4 - A Dance with Dragons: A Song of Ice and Fire, Book 5 - Twilight: The Twilight Saga, Book 1- New Moon: The Twilight Saga, Book 2 - Eclipse: The Twilight Saga, Book 3- Breaking Dawn: The Twilight Saga, Book 4- The Hunger Games- Catching Fire: The Hunger Games- Mockingjay: The Hunger Games, Book 3```{r}scrape_amazon <-function(ASIN, page_num){ url_reviews <-paste0("https://www.amazon.com/product-reviews/",ASIN,"/?pageNumber=",page_num) doc <-read_html(url_reviews) # Assign results to `doc`# Review Title doc %>%html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%html_text() -> review_title# Review Text doc %>%html_nodes("[class='a-size-base review-text review-text-content']") %>%html_text() -> review_text# Number of stars in review doc %>%html_nodes("[data-hook='review-star-rating']") %>%html_text() -> review_star# Return a tibbletibble(review_title, review_text, review_star,page = page_num, ASIN) %>%return()}```Using the above function I have scraped equal number of reviews for each series to compare them. I have used a for loop and sleep time of 2 seconds to avoid bot detection. Then converted the whole data into csv format.<!-- ```{r} --><!-- out <- scrape_amazon("B0001DBI1Q", 1) --><!-- for (i in 2:400) { --><!-- out <- bind_rows(out, scrape_amazon("B0001DBI1Q", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B0001MC01Y", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B00026WUZU", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B07ZN4WM13", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B005C7QVUE", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B000BO2D64", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 336:400) { --><!-- out <- bind_rows(out, scrape_amazon("B000I2JFQU", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 311:400) { --><!-- out <- bind_rows(out, scrape_amazon("B000UW50LW", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B001FD6RLM", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B07HHJ7669", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B07T6BQV2L", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B07T43YYRY", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- write.csv(out, "amazonreview.csv") --><!-- ``` -->## Reading the data```{r}reviews <-read_csv("amazonreview.csv")reviews```## Data PreprocessingI have cleaned the text of the reviews by removing punctuations, numbers, UTF symbols, &, digits, new line characters, single length words, Pascal case words were removed from the tweets text using the `stringr` library departments communicate information to alleviate specific public functions. Stopwords are removed. The stopwords collection is taken from `stopwords-iso` and `SMART`. Stemming is not preferred here as the meaning of the word is important for analysis. And then categorized the data based on the book title and series title. The analysis done in the project is based on the categorized series titles to compare sentiments and topics on basis of series.```{r}clean_text <-function (text) {# Remove urlstr_remove_all(text," ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%# Remove mentionsstr_remove_all("@[[:alnum:]_]*") %>%# Replace "&" character reference with "and"str_replace_all("&", "and") %>%# Remove punctuationstr_remove_all("[[:punct:]]") %>%# remove digitsstr_remove_all("[[:digit:]]") %>%# Replace any newline characters with a spacestr_replace_all("\\\n|\\\r", " ") %>%# remove strings like "<U+0001F9F5>"str_remove_all("<.*?>") %>%# Make everything lowercasestr_to_lower() %>%# Remove any trailing white space around the text and inside a stringstr_squish()}```## Tidying the cleaned dataDropping the NA in the cleaned text.```{r}reviews$clean_text <-clean_text(reviews$review_text) reviews <- reviews %>%drop_na(clean_text)reviews```### Corpus of the data```{r}text <-corpus(c(reviews$clean_text))text <-tokens(text, remove_punct=TRUE, remove_numbers =TRUE, remove_separators =TRUE, remove_symbols =TRUE) %>%tokens_select(pattern=c(stopwords("en"), "im", "didnt", "couldnt","wasnt", "id", "ive", "isnt", "dont", "wont", "shes", "doesnt"), selection="remove") %>%tokens_select(pattern=stopwords("SMART"), selection="remove")text <-dfm(text)text``````{r}sum(ntoken(text))```The total number of tokens in the text are `1682431`.```{r}summary(corpus(c(reviews$clean_text)))```Finding the frequency and rank of each word in the data```{r}word_counts <-as.data.frame(sort(colSums(text),dec=T))colnames(word_counts) <-c("Frequency")word_counts$word <-row.names(word_counts)word_counts$Rank <-c(1:ncol(text))word_counts``````{r}text_dfm <-dfm_trim(text, min_termfreq =50, docfreq_type ="prop")# create fcm from dfmtext_fcm <-fcm(text_dfm)text_fcm```Finding the top features from the above feature co-occurrence matrix and ploting the network plot.```{r}# pull the top featurestop_features <-names(topfeatures(text_fcm, 50))# retain only those top features as part of our matrixeven_text_fcm <-fcm_select(text_fcm, pattern = top_features, selection ="keep")# check dimensionsdim(even_text_fcm)# compute size weight for vertices in networksize <-log(colSums(even_text_fcm))# create plottextplot_network(even_text_fcm, vertex_size = size /max(size) *2)```## Wordcloud of the data```{r}textplot_wordcloud(text, min_size =1.5, max_size =4, random_order =TRUE, max_words =150, min_count =50, color =brewer.pal(8, "Dark2") )```I can observe that book has the highest count and then there are interesting words like great, enjooyed, amazing, good, love which express the feelings of the people which can help in our sentimental analysis.## Further studyI will be transforming and categorizing data and also plot some analysis plots and if possible also do some sentiment analysis.